Here we will talk about an important piece of machine learning: the extraction of quantitative features from data. By the end of this section you will
In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.
Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size n_samples $\times$ n_features.
Previously, we looked at the iris dataset, which has 150 samples and 4 features
In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.data.shape)
These features are:
Numerical features such as these are pretty straightforward: each sample contains a list of floating-point numbers corresponding to the features
What if you have categorical features? For example, imagine there is data on the color of each iris:
color in [red, blue, purple]
You might be tempted to assign numbers to these features, i.e. red=1, blue=2, purple=3 but in general this is a bad idea. Estimators tend to operate under the assumption that numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike than 1 and 3, and this is often not the case for categorical features.
A better strategy is to give each category its own dimension.
The enriched iris feature set would hence be in this case:
Note that using many of these categorical features may result in data which is better represented as a sparse matrix, as we'll see with the text classification example below.
When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the DictVectorizer
class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:
In [ ]:
measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Francisco', 'temperature': 18.},
]
In [ ]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec
In [ ]:
vec.fit_transform(measurements).toarray()
In [ ]:
vec.get_feature_names()
Another common feature type are derived features, where some pre-processing step is applied to the data to generate features that are somehow more informative. Derived features may be based in dimensionality reduction (such as PCA or manifold learning), may be linear or nonlinear combinations of features (such as in Polynomial regression), or may be some more sophisticated transform of the features. The latter is often used in image processing.
For example, scikit-image provides a variety of feature
extractors designed for image data: see the skimage.feature
submodule.
We will see some dimensionality-based feature extraction routines later in the tutorial.
As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.
We will use a version of the Titanic (titanic3.xls) dataset from Thomas Cason, as retrieved from Frank Harrell's webpage here. We converted the .xls to .csv for easier manipulation without involving external libraries, but the data is otherwise unchanged.
In [ ]:
import numpy as np
from collections import defaultdict
import os
f = open(os.path.join('datasets', 'titanic', 'titanic3.csv'))
# Remove . from home.dest, split on quotes because some fields have commas
keys = f.readline().strip().replace('.', '').split('","')
lines = f.readlines()
f.close()
We read in all the lines from the (titanic3.csv) file, and set aside the keys from the first line. Let's look at the keys and some corresponding example lines.
In [ ]:
print(keys)
print()
for i in range(3):
print(lines[i])
print()
The site linked here gives a broad description of the keys and what they mean - we show it here for completeness
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival
(0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
In general, it looks like name
, sex
, cabin
, embarked
, boat
, body
, and homedest
may be candidates for categorical features, while the rest appear to be numerical features. We can now write a function to extract features from a text line, shown below.
In [ ]:
# Can also use pandas
def process_line(line):
# Split line on "," to get fields without comma confusion
vals = line.strip().split('",')
# replace spurious " characters
vals = [v.replace('"', '') for v in vals]
pclass = int(vals[0])
survived = int(vals[1])
name = str(vals[2])
sex = str(vals[3])
try:
age = float(vals[4])
except ValueError:
# Blank age
age = -1
sibsp = float(vals[5])
parch = int(vals[6])
ticket = str(vals[7])
try:
fare = float(vals[8])
except ValueError:
# Blank fare
fare = -1
cabin = str(vals[9])
embarked = str(vals[10])
boat = str(vals[11])
homedest = str(vals[12])
line_dict = {'pclass': pclass, 'survived': survived, 'name': name, 'sex': sex, 'age': age, 'sibsp': sibsp,
'parch': parch, 'ticket': ticket, 'fare': fare, 'cabin': cabin, 'embarked': embarked,
'boat': boat, 'homedest': homedest}
return line_dict
Let's process an example line using this function to see the expected output.
In [ ]:
print(process_line(lines[0]))
Now we seem to have extracted relevant features, and found a representation that will eventually work with DictVectorizer
. However, first we need to break the dataset into six pieces: training categorical, training numeric, testing categorical, testing numeric, training labels, and testing labels.
In [ ]:
string_keys = ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'homedest']
numeric_keys = ['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare']
train_vectorizer_list = []
test_vectorizer_list = []
n_samples = len(lines)
numeric_data = np.zeros((n_samples, len(numeric_keys) - 1))
numeric_labels = np.zeros((n_samples,), dtype=int)
random_state = np.random.RandomState(1999)
indices = np.arange(len(lines))
random_state.shuffle(indices)
# 1309 total samples - use random 1100 to train and 209 to test
train_idx = indices[:1100]
test_idx = indices[1100:]
for n, l in enumerate(lines):
line_dict = process_line(l)
strings = {k: line_dict[k] for k in string_keys}
if n in train_idx:
train_vectorizer_list.append(strings)
else:
test_vectorizer_list.append(strings)
numeric_data[n] = np.asarray([line_dict[k] for k in numeric_keys if k != "survived"])
numeric_labels[n] = line_dict["survived"]
train_numeric = numeric_data[train_idx]
test_numeric = numeric_data[test_idx]
train_labels = numeric_labels[train_idx]
test_labels = numeric_labels[test_idx]
vec = DictVectorizer()
# .toarray() due to returning a scipy sparse array
train_categorical = vec.fit_transform(train_vectorizer_list).toarray()
test_categorical = vec.transform(test_vectorizer_list).toarray()
Now we can combine everything into training and testing datasets by combining the categorical and numeric data, concatenating along the feature axis.
In [ ]:
train_data = np.concatenate([train_numeric, train_categorical], axis=1)
test_data = np.concatenate([test_numeric, test_categorical], axis=1)
With all of the hard data work out of the way, evaluating a classifier on this data becomes straightforward. Using a RandomForestClassifier we see decent performance - about 70% accuracy in predicting who survives the Titanic disaster. With random chance being 50%, we are doing a decent job, though there are many refinements which could be made.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_data, train_labels)
pred_labels = clf.predict(test_data)
print("Prediction accuracy: %f" % accuracy_score(pred_labels, test_labels))
Can you remove or create new features to improve your score? Try plotting feature importance as shown in this sklearn example or removing features like name
, body
, and homedest
. Can you fill in missing data (see the try, except pieces of process_line) and improve your score further?
In [ ]: